Supervised Learning
Given training data D = \{(x, t)\} with input examples (x) and desired outputs (t), the goal is to find an approximation of an unknown function f that can generalize to new data.
Supervised learning is useful when:
- No human expert exists (e.g., DNA sequence analysis)
- The task is performable but not explicable (e.g., image recognition)
- The underlying function changes frequently (e.g., stock price prediction)
- Personalization is needed for each user (e.g., spam filters)
To define a supervised learning problem, we need:
- Loss Function L(t, y(x)): A measure of error between the true target t and the model’s prediction y(x)
- Hypothesis Space (H): the set of candidate functions we search within to find the best model
- Optimization Algorithm: the procedure to improve the hypothesis by minimizing the loss function
If f is known, the problem is function approximation. When f is unknown, we estimate it from data using a loss function that accounts for noise.
Model Evaluation
A model is evaluated based on its expected loss, which is the average loss over the all the possible data:
\mathbb{E}[L] = \int \int L(t, y(x)) \, p(x, t) dx dt
where:
- L(t, y(x)): loss function measuring error between true target and prediction
- p(x, t): true joint distribution of inputs and targets (probability of seeing a particular input-target pair)
Knowing p(x, t) would be equivalent to knowing f (the generating function). The joint distribution is approximated from training data.
The model y(x) that minimizes the expected loss is the conditional mean computed as: y^*(x) = \mathbb{E}[t|x] = \int t \, p(t|x) \, dt
Loss Functions
A common loss function is the Minkowski loss: L(t, y(x)) = |t - y(x)|^q
where q is a parameter that controls how errors are penalized.
- For q=2 (squared loss), large errors are penalized more heavily.
- For q=1 (absolute loss), the model is more robust to outliers.
- For q=\infty (max loss), only the worst error matters.
Bias-Variance
- Bias: Error due to hypothesis space H being too restrictive to contain the true function. When |H| is small, bias is high.
- Variance: Error due to overfitting to training data. When |H| is large, variance is high.
The tradeoff:
- Small H → high bias, low variance (underfitting)
- Large H → low bias, high variance (overfitting)
More training samples reduce variance for a given model complexity, reducing the noise.
Approaches to Supervised Learning
To solve supervised learning problems, there are three main approaches:
Generative Approach
The generative approach models how the data was generated by learning the joint distribution p(x, t) = p(t|x)p(x).
Once the model is learned, it’s possible to:
- Generate new synthetic data
- Infer the conditional density p(t|x) = \frac{p(x,t)}{p(x)} for predictions
This approach is powerful but often requires modeling complex distributions, which can be difficult and computationally expensive.
Discriminative Approach
The discriminative approach focuses directly on modeling the relationship between inputs and outputs by learning the conditional distribution p(t|x), without modeling the input distribution p(x).
Predictions use the conditional mean: \mathbb{E}[t|x] = \int t \, p(t|x) \, dt
This approach is more efficient for prediction tasks, as it doesn’t require modeling the full data distribution.
Direct Approach
The direct approach avoids probability modeling. Instead of learning distributions, it directly minimizes a loss function on the training data to find the best mapping from inputs to targets.
For prediction tasks, you just need the mapping x \to t to be accurate.
This approach is often more straightforward and computationally efficient.
Linear Regression
The goal is to learn a function that maps input features x to target output t. Linear regression models assume this relationship is linear in the parameters (though features can be nonlinearly transformed).
The solution can be found analytically.
y(x, w) = w_0 + \sum_{j=1}^{D-1} w_j x_j = w^T \phi(x)
where:
- \phi(x) = (1, x_1, \dots, x_{D-1}): augmented feature vector
- w = (w_0, w_1, \dots, w_{D-1}): weight vector
- w_0: bias term
Linear models are used:
- Easy to optimize: Convex loss function guarantees global optimum
- Easy to interpret: Weights directly show feature importance
- Multiple outputs: Can be extended to predict multiple targets simultaneously as each output is a linear combination of the same features
Ordinary Least Squares (OLS)
The Ordinary Least Squares (OLS) is a direct method that finds the weights that minimize the error on the training data.
Since we cannot compute the true expected loss (the joint distribution is unknown), we approximate it with the empirical loss computed from the N training data:
L(w) = \frac{1}{2} \sum_{n=1}^N (y(x_n, w) - t_n)^2
- The factor \frac{1}{2} is a scale factor for computational convenience.
This is also called residual sum of squares (RSS) or sum of squared errors (SSE), and can also be written as:
RSS(w) = \|\epsilon\|_2^2 = \sum_{n=1}^N \epsilon_n^2
where:
- \epsilon_n = y(x_n, w) - t_n is the residual error for the n-th training example.
The p-norm of the residuals is a generalization of the loss function and assign a measure of the error from a vector.
- For p=1, we get the L1 norm (sum of absolute errors), that is represented by a diamond-shaped plot.
- For p=2, we get the L2 norm (sum of squared errors), that is represented by a circular plot.
- For p=\infty, we get the L-infinity norm (maximum error), that is represented by a square-shaped plot.
Matrix Formulation:
L(w) = \frac{1}{2} \text{RSS}(w) = \frac{1}{2} (t - \Phi w)^T (t - \Phi w)
where:
- t = [t_1, \dots, t_N]^T: vector of target values (N-dimensional)
- \Phi = [\phi(x_1), \dots, \phi(x_N)]^T: matrix with samples as rows, features as columns (N \times M)
- w: weight/parameter vector (M-dimensional)
Solution
The solution is found by setting the gradient of the loss to zero:
\frac{\partial L(w)}{\partial w} = - \Phi^T (t - \Phi w) = 0
The Hessian (second derivative) is: \frac{\partial^2 L(w)}{\partial w \partial w^T} = \Phi^T \Phi
This is positive definite if \Phi has full column rank, ensuring a unique global minimum.
The point where the gradient is zero corresponds to the minimum of the loss function, and in that point there is no correlation between the residuals and the features. This means that the model has captured all the linear relationships in the data, and any remaining error is due to noise.
Solving the gradient for w: \hat{w}_{OLS} = (\Phi^T \Phi)^{-1} \Phi^T t
This is the Ordinary Least Squares (OLS) solution, which gives the best linear fit to the training data in terms of minimizing the sum of squared errors.
To work properly, OLS requires:
- More samples than features: N \geq M
- No redundant features: Features must be linearly independent, otherwise the matrix is singular and cannot be inverted.
- Computational budget: Inversion costs O(M^3)
Linear Model Evaluation
To evaluate the performance of a linear regression model, we use the Root Mean Squared Error (RMSE), which is derived from the residual sum of squares (RSS):
E_{\text{RMS}} = \sqrt{\frac{2 * \text{RSS}(\hat{w})}{N}}
Variance
To estimate the variance of the noise, we can use the residuals from the fitted model: \hat{\sigma}^2 = \frac{1}{N - M} \sum_{n=1}^N (t_n - \hat{w}^T \phi(x_n))^2
where:
- N: number of samples
- M: number of parameters (features)
Meaning that more more samples reduce the variance of the noise estimate, while more parameters increase it.
Based on the Gauss-Markov theorem, the OLS estimator is the Best Linear Unbiased Estimator, meaning it has the lowest variance among all linear unbiased estimators.
Stochastic Gradient Descent (SGD)
This is an iterative optimization algorithm that updates the weights incrementally using one sample at a time, rather than computing the gradient over the entire dataset.
The loss function can be written as the sum of the loss function for each sample L(w) = \sum_{n=1}^N L(x_n).
w^{(k+1)} = w^{(k)} - \alpha^{(k)} \nabla L(x_n)
For squared loss: w^{(k+1)} = w^{(k)} - \alpha^{(k)} (w^{(k)T} \phi(x_n) - t_n) \phi(x_n)
where:
- \alpha^{(k)}: learning rate at iteration k
- w^{(k)}: weight vector at iteration k
At each iteration, the algorithm performs the following steps:
- Compute prediction error: (w^T \phi(x_n) - t_n)
- Compute error gradient: Multiply by the feature vector
- Move weights in opposite direction with a step defined by the learning rate \alpha^{(k)}: w := w - \alpha^{(k)} \times (\text{error gradient})
This method is more efficient for large datasets, as it avoids the costly matrix inversion required by OLS and allows for online learning. However each update is influenced by the noise of a single sample.
SDG can converge to the OLS solution only when the learning rate decays over time and satisfies the Robbins-Monro conditions:
- \sum_{k=1}^\infty \alpha^{(k)} = \infty: ensures that the algorithm continues to make progress towards the minimum
- \sum_{k=1}^\infty (\alpha^{(k)})^2 < \infty: ensures that the steps become small enough to converge to the minimum without oscillating around it.
Geometric Interpretation
The OLS solution projects the target vector t onto the feature space spanned by columns of \Phi:
\hat{t} = \Phi \hat{w}_{OLS} = \underbrace{\Phi (\Phi^T \Phi)^{-1} \Phi^T}_{\text{Projection Matrix } P} t
The projection \hat{t} should be as close as possible to t.
Maximum Likelihood Estimation (MLE)
The Maximum Likelihood Estimation (MLE) is a generative method that try to find the model which is most likely to have generated the observed data.
Assume targets are generated by a function summed with Gaussian noise: t = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)
This approach maximizes the likelihood, which is the probability of observing the data given the parameters:
p(t|X, w, \sigma^2) = \prod_{n=1}^N \mathcal{N}(t_n | w^T \Phi(x_n), \sigma^2)
This optimization problem is equivalent to minimizing the sum of squared errors, as maximizing the likelihood corresponds to finding the parameters that make the observed data most probable under the assumed model.
To solve it, we take the logarithm of the likelihood (log-likelihood) to simplify the product into a sum: \ln p(t|X, w, \sigma^2) = \sum_{n=1}^N \ln \mathcal{N}(t_n | w^T \Phi(x_n), \sigma^2) = - \frac{N}{2} \ln (2\pi \sigma^2) - \frac{1}{2\sigma^2} \text{RSS}(w)
Setting the derivative to zero:
\nabla l(w) = \sum_{n=1}^N t_n \Phi(x_n)^T - w^T \sum_{n=1}^N \Phi(x_n) \Phi(x_n)^T = 0 \hat{w}_{ML} = (\Phi^T \Phi)^{-1} \Phi^T t
Regularization
Regularization is a technique to prevent overfitting by adding a penalty term to the loss function that discourages complex models (too many parameters or weights too big).
Modified loss function: L(w) = \underbrace{L_D(w)}_{\text{error on data}} + \lambda \underbrace{L_w(w)}_{\text{model complexity}} L(w) = \underbrace{\frac{1}{2} \sum_{n=1}^N (t_n - w^T \Phi(x_n))^2}_{\text{RSS}} + \lambda L_w(w)
The regularization parameter \lambda controls the tradeoff:
- \lambda = 0: Only fit data (standard OLS)
- \lambda \to \infty: Weights forced to zero
Ridge Regression (L2 Regularization)
Penalize the L2 norm (sum of squares) of weights: L_w(w) = \frac{1}{2}\|w\|_2^2 = \frac{1}{2} \sum_{j=1}^M w_j^2
This encourages smaller weights, but does not set them to zero.
Full loss function: L(w) = \frac{1}{2} \sum_{n=1}^N (t_n - w^T \Phi(x_n))^2 + \frac{\lambda}{2} \|w\|_2^2
The solution is: \hat{w}_{\text{ridge}} = (\Phi^T \Phi + \lambda I)^{-1} \Phi^T t
Lasso Regression (L1 Regularization)
Penalize the L1 norm (sum of absolute values): L_w(w) = \|w\|_1 = \sum_{j=1}^M |w_j|
This generates sparse solutions, where some weights are exactly zero, effectively performing feature selection, removing irrelevant features from the model.
Full loss function: L(w) = \frac{1}{2} \sum_{n=1}^N (t_n - w^T \Phi(x_n))^2 + \lambda \|w\|_1
Bayesian Linear Regression
The Bayesian Linear Regression is a probabilistic approach to linear regression that incorporates uncertainty about the model parameters.
Before seeing data, there is an assumption about the distribution of weights, called the prior distribution. A common choice is a Gaussian prior: p(w) = \mathcal{N}(w | w_0, S_0)
After observing data, Bayes’ theorem is used to update the belief about the weights, resulting in the posterior distribution: \overbrace{p(w|\mathcal{D})}^{\text{posterior}} = \frac{\overbrace{p(\mathcal{D}|w)}^{\text{likelihood}} \, \overbrace{p(w)}^{\text{prior}}}{p(\mathcal{D})}
where:
- p(\mathcal{D}) = \int p(\mathcal{D}|w) p(w) dw: is the marginal likelihood, which normalizes the posterior distribution.
After updating, the weights still maintain a Gaussian distribution (conjugate prior): \overbrace{p(w|t, \Phi, \sigma^2)}^{\text{posterior}} \propto \overbrace{\mathcal{N}(w | w_0, S_0)}^{\text{prior}} \, \overbrace{\mathcal{N}(t | \Phi w, \sigma^2 I)}^{\text{likelihood}} = \mathcal{N}(w | w_N, S_N)
where: w_N = S_N \left( S_0^{-1} w_0 + \frac{\Phi^T t}{\sigma^2} \right) S_N^{-1} = S_0^{-1} + \frac{\Phi^T \Phi}{\sigma^2}
With zero-mean gaussian prior (w_0 = 0, S_0 = \tau^2 I), the w_N is the MAP estimator that is equal to the ridge regression solution with \lambda = \frac{\sigma^2}{\tau^2}.
Predictive Distribution
The Bayesian approach allows to compute the probability distribution of the target for a new input x, called the predictive distribution: p(t| x, \mathcal{D}, \sigma^2) = \int \mathcal{N}(t | w^T \phi(x), \sigma^2) \mathcal{N}(w | w_N, S_N) dw = \mathcal{N}(t | w_N^T \phi(x), \sigma^2_N)
where: \sigma^2_N = \underbrace{\sigma^2}_{\text{data noise}} + \underbrace{\phi(x)^T S_N \phi(x)}_{\text{parameters uncertainty}}